Checkpointing in Parallel State-Machine Replication

نویسندگان

Odorico Machado Mendizabal

Parisa Jalili Marandi

Fernando Luís Dotti

Fernando Pedone

چکیده

State-machine replication is a popular approach to building fault-tolerant systems, which relies on the sequential execution of commands to guarantee strong consistency. Sequential execution, however, threatens performance. Recently, several proposals have suggested parallelizing the execution model of the replicas to enhance state-machine replication’s performance. Despite their success in accomplishing high performance, the implications of these models on checkpointing and recovery is mostly left unaddressed. In this paper, we focus on the checkpointing problem in the context of Parallel State-Machine Replication. We propose two novel algorithms and assess them through simulation and a real implementation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Efficiency of Durable State Machine Replication

State Machine Replication (SMR) is a fundamental technique for ensuring the dependability of critical services in modern internet-scale infrastructures. SMR alone does not protect from full crashes, and thus in practice it is employed together with secondary storage to ensure the durability of the data managed by these services. In this work we show that the classical durability enforcing mecha...

متن کامل

A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend...

متن کامل

A Comprehensive User-level Checkpointing Strategy for MPI Applications

As computational clusters increase in size, their mean-time-to-failure reduces drastically. After a failure, most MPI checkpointing solutions require a restart with the same number of nodes. This necessitates the availability of multiple spare nodes, leading to poor resource utilization. Moreover, most techniques require a central storage for storing checkpoints. This results in a bottleneck an...

متن کامل

A Novel Replication Technique for Implementing Fault-Tolerant Parallel Software

In this paper we present a novel replication technique based on the FTAG computation model. FTAG is a functional and attribute based language for programming fault-tolerant parallel applications [4]. FTAG have a tree structure computation model. In the replication technique developed an application is replicated on di erent group of processors. Each group is called a replica. All replicas are a...

متن کامل

Eecient, Language-based Checkpointing for Massively Parallel Programs

Checkpointing and restart is an approach to ensuring forward progress of a program in spite of system failures or planned interruptions. We investigate issues in checkpointing and restart of programs running on massively parallel computers. We identify a new set of issues that have to be considered for the MPP platform, based on which we have designed an approach based on the language and run-t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Checkpointing in Parallel State-Machine Replication

نویسندگان

چکیده

منابع مشابه

On the Efficiency of Durable State Machine Replication

A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

A Comprehensive User-level Checkpointing Strategy for MPI Applications

A Novel Replication Technique for Implementing Fault-Tolerant Parallel Software

Eecient, Language-based Checkpointing for Massively Parallel Programs

عنوان ژورنال:

اشتراک گذاری